Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix matmul incorrect results when k dim for CTA tile is a multiple of 16 #3616

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

rdspring1
Copy link
Collaborator

@rdspring1 rdspring1 commented Dec 19, 2024

This PR fixes the incorrect results issue when k dimension for CTA tile is a multiple of getK(mma_macro).

Why?

  • In scheduleMmaResults, we need to split the k reduction by getK(mma_macro). A serial reduction will add the results from wgmma along k-dimension.

Details

  • Modified transformLikeMmaOutput function to not be used in scheduleMmaResults.
  • Add HSH_TN_UseScheduler test

Performance Results

HSH_TN_UseScheduler - 1.038 - Basically a tie

CTA tile is M=64, K=256, K=32

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     27.5           466137          1  466137.0  466137.0    466137    466137          0.0  nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_TNN                                                           
     26.5           448985          1  448985.0  448985.0    448985    448985          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…

HSH_NT_UseScheduler - 0.9155

CTA tile is M=64, K=256, K=32

 Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)                                                  Name                                                
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  ----------------------------------------------------------------------------------------------------
     27.7           468281          1  468281.0  468281.0    468281    468281          0.0  <unnamed>::nvfuser_none_f0_c0_r0_g0(<unnamed>::Tensor<<unnamed>::__half, (int)3, (int)3>, <unnamed>…
     25.4           428729          1  428729.0  428729.0    428729    428729          0.0  nvjet_hsh_128x256_64x4_2x1_v_bz_coopA_NTN   

@rdspring1
Copy link
Collaborator Author

!test

Comment on lines 4036 to 4037
// NOTE Certain combinations of cta k dimension and circular buffer
// prefetching can get incorrect results.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀


mparams.supported_vec_size = {8, 8, 4};
mparams.supported_vec_size = {8, 8, 8};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, although I think it currently has no meaning until we start handling epilogue inputs with supported vec size.

@rdspring1
Copy link
Collaborator Author

!test

@rdspring1 rdspring1 force-pushed the hopper_matmul_cta_k_fix branch 3 times, most recently from 952cb0a to 0e213d6 Compare December 20, 2024 23:36
@rdspring1 rdspring1 force-pushed the hopper_matmul_cta_k_fix branch from d366352 to 0c784e0 Compare December 20, 2024 23:56
@rdspring1
Copy link
Collaborator Author

!test

jacobhinkle added a commit that referenced this pull request Jan 2, 2025
I think this covers the motivation for #3616
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants